Bike sharing systems are a means of renting bicycles where the process of obtaining membership, rental, and bike return is automated via a network of kiosk locations throughout a city. Using these systems, people are able rent a bike from a one location and return it to a different place on an as-needed basis. Currently, there are over 500 bike-sharing programs around the world.
The data generated by these systems makes them attractive for researchers because the duration of travel, departure location, arrival location, and time elapsed is explicitly recorded. Bike sharing systems therefore function as a sensor network, which can be used for studying mobility in a city. In this competition, participants are asked to combine historical usage patterns with weather data in order to forecast bike rental demand in the Capital Bikeshare program in Washington, D.C.
Import Apache Spark 2.1libraries and setup environment variables. In case for cluster deployment you should probably remove this cell.
import os
import sys
import time
# Change to path where apache spark 2.x is downloaded
SPARK_PATH = '/users/suchy/Documents/spark-2.1.0-bin-hadoop2.7'
os.environ['SPARK_HOME'] = SPARK_PATH
SPARK_HOME = os.environ['SPARK_HOME']
#Add the following paths to the system path.
sys.path.insert(0,os.path.join(SPARK_HOME,"python"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","pyspark.zip"))
sys.path.insert(0,os.path.join(SPARK_HOME,"python","lib","py4j-0.10.4-src.zip"))
Initialize Spark Context for driver program
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext
conf = (SparkConf()
.setAppName("Capital_Bike_weather_mining")
.set("spark.executor.memory", "4g")
.set("spark.driver.memory", "8g"))
#.setMaster("yarn") - for cluster deployment
sc = SparkContext.getOrCreate(conf = conf)
sqlContext = SQLContext(sc)
Import pandas, numpy, matplotlib,and seaborn. Then set %matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kstest
from sklearn import preprocessing
from scipy.stats import skew
from scipy.stats import boxcox
from scipy.special import inv_boxcox
import pickle
%matplotlib inline
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 200)
We are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. We must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.
</a>Read in the Customers Bike Rental csv train and test files as a Pandas DataFrame.
train_df = pd.read_csv("../data/train.csv")
test_df = pd.read_csv("../data/test.csv")
There are 3 dependent variables: casual, registered and count (casual + registered) .
We will try both methods: predict amount of casual and regitered rentals separately and predict sum of those (count)
train_df.head()
test_df.head()
Check out customers rental info() and describe() methods. Thera are total 10866 entries for traing data and 6493 entries for test data, none of the column has missing values
train_df.info()
test_df.info()
continuous_var = ['temp', 'atemp', 'humidity', 'windspeed']
# exclude binary variables
categorical_var=['season','weather','year','month','hour','dayofweek','weekofyear','quarter', 'temp_cat']
dependent_variables = ['casual', 'registered', 'count']
dependent_variables_log = ['casual_log', 'registered_log', 'count_log']
dependent_variables_bc = ['casual_bc', 'registered_bc', 'count_bc']
# dependent_variables_cat used for ChiSqSelector, which works only for classified labels
dependent_variables_cat = ['casual_cat', 'registered_cat', 'count_cat']
# categories from set_hour_cat() function
hour_cat_list = [0,1,2,3,4,5,6,7]
added_cat_var = ['season_cat', 'month_cat', 'dayofweek_cat', 'weather_cat']
seasonDict = {1: "Winter", 2 : "Spring", 3 : "Summer", 4 :"Fall" }
monthDict = {1: "January", 2 : "February", 3 : "March", 4 :"April" ,\
5 : "May", 6 : "June", 7 : "July", 8 :"August" ,\
9: "September", 10 : "October", 11 : "November", 12 :"December"}
dayofweekDict = {0: "Monday", 1 : "Tuesday", 2 : "Wednesday", 3 :"Thursday" ,\
4 : "Friday", 5 : "Saturday", 6 : "Sunday"}
weatherDict = {1: "Clear", 2 : "Mist", 3 : "Light_Snow", 4 :"Heavy_Rain" }
Data preparation part: parse data timestamp, add dummy and derivated variables. For data including categorical variables with different number of levels, random forests are biased in favor of those attributes with more levels. So it's good idea to have categorical variables with similar level of unique values.
Lot's of data from the function below are taken in next section of Exploratory Data Analysis, like rental peak hours.
def set_derivated_vars(dataset):
customers_rental = dataset.copy()
# create derivate variable: day-to-day and hour change for continuous variables
for var in continuous_var:
customers_rental[var+'_prev_hour_change_pct'] = customers_rental[var].pct_change(periods=1) * 100
customers_rental[var+'_prev_day_change_pct'] = customers_rental[var].pct_change(periods=24) * 100
customers_rental['weather_prev_hour_change_pct'] = customers_rental['weather'].pct_change(periods=1) * 100
customers_rental['weather_prev_day_change_pct'] = customers_rental['weather'].pct_change(periods=24) * 100
customers_rental['atemp_temp_diff'] = customers_rental['atemp'] - customers_rental['temp']
# first day don't have pct_change, as well as there are some divide by zero operation (inf)
customers_rental = customers_rental.replace([np.inf, -np.inf], np.nan)
# Replace nan by interpolating, first argument can't be NaN for interpolation to works,
# In Pandas interpolate implementation NaN variable above the range
# will we interpolated as well (like extrapolation)
customers_rental.iloc[0] = customers_rental.iloc[0].fillna(customers_rental.mean())
customers_rental = customers_rental.interpolate(method='time')
customers_rental = set_temp_cat(customers_rental)
return customers_rental
def set_temp_cat(dataset):
customers_rental = dataset.copy()
# create additional variable about rental traffic by temp,
# specific hours are taken from Exploratoty Data Analysis from clustermap by the end of next section
customers_rental['temp_cat'] = 0
customers_rental.loc[(customers_rental['temp'].round() <= 12), 'temp_cat'] = 1
customers_rental.loc[(customers_rental['temp'].round() >= 13) & (customers_rental['temp'].round() <= 30) & \
(customers_rental['workingday'] == 0), 'temp_cat'] = 2
customers_rental.loc[(customers_rental['temp'].round() >= 13) & (customers_rental['temp'].round() <= 30) & \
(customers_rental['workingday'] == 1),'temp_cat'] = 3
customers_rental.loc[(customers_rental['temp'].round() >= 31), 'temp_cat'] = 4
return customers_rental
def set_hour_cat(dataset):
customers_rental = dataset.copy()
# create additional variable about rental traffic by hour,
# specific hours are taken from Exploratoty Data Analysis from clustermap by the end of next section
customers_rental['hour_cat'] = -1
# workingday = 1
customers_rental.loc[(customers_rental['hour'] <= 5) & (customers_rental['workingday'] == 1), 'hour_cat'] = 0
customers_rental.loc[((customers_rental['hour'] == 6) | (customers_rental['hour'] == 10) | \
(customers_rental['hour'] == 11) | (customers_rental['hour'] ==22) | \
(customers_rental['hour'] == 23)) & (customers_rental['workingday'] == 1), 'hour_cat'] = 1
customers_rental.loc[(customers_rental['hour'] >= 7) & (customers_rental['hour'] <= 21) & \
(customers_rental['workingday'] == 1), 'hour_cat'] = 2
customers_rental.loc[((customers_rental['hour'] == 8) | (customers_rental['hour'] == 17) | \
(customers_rental['hour'] == 18)) & (customers_rental['workingday'] == 1), 'hour_cat'] = 3
# workingday = 0
customers_rental.loc[((customers_rental['hour'] >= 22) | (customers_rental['hour'] <= 1) | \
(customers_rental['hour'] == 8)) & (customers_rental['workingday'] == 0), 'hour_cat'] = 4
customers_rental.loc[((customers_rental['hour'] >= 2) & (customers_rental['hour'] <= 7)) & \
(customers_rental['workingday'] == 0), 'hour_cat'] = 5
customers_rental.loc[((customers_rental['hour'] == 9) | (customers_rental['hour'] == 20) | \
(customers_rental['hour'] == 21)) & (customers_rental['workingday'] == 0), 'hour_cat'] = 6
customers_rental.loc[((customers_rental['hour'] >= 10) & (customers_rental['hour'] <= 19)) & \
(customers_rental['workingday'] == 0), 'hour_cat'] = 7
return customers_rental
def set_independent_weather(dataset):
customers_rental = dataset.copy()
#############################################
# data taken from independent source weather
# https://www.washingtonpost.com/blogs/capital-weather-gang/post/2011-in-washington-dc-warm-with-extreme-weather-aplenty/2011/12/27/gIQAnNXSMP_blog.html?utm_term=.7a2584169af6
# https://www.washingtonpost.com/blogs/capital-weather-gang/post/top-5-dc-weather-events-of-2012/2012/12/28/d384311c-4f0e-11e2-950a-7863a013264b_blog.html?utm_term=.1a10a3265c33
#############################################
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 4) \
& (customers_rental['day'] == 15), "workingday"] = 1
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 4) \
& (customers_rental['day'] == 16), "workingday"] = 1
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 4) \
& (customers_rental['day'] == 15), "holiday"] = 0
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 4) \
& (customers_rental['day'] == 16), "holiday"] = 0
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 11) \
& (customers_rental['day'] == 25), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 11) \
& (customers_rental['day'] == 23), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 11) \
& (customers_rental['day'] == 25), "workingday"] = 0
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 11) \
& (customers_rental['day'] == 23), "workingday"] = 0
customers_rental.loc[(customers_rental['month'] == 12) \
& ((customers_rental['day'] == 24) | \
(customers_rental['day'] == 26)), "workingday"] = 0
customers_rental.loc[(customers_rental['month'] == 12) \
& ((customers_rental['day'] == 24) | \
(customers_rental['day'] == 25) | \
(customers_rental['day'] == 26)), "holiday"] = 1
# 2011
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 1) \
& (customers_rental['day'] == 26), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 9) \
& (customers_rental['day'] >= 4) & (customers_rental['day'] <= 8), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 8) \
& (customers_rental['day'] == 28), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] == 4) \
& (customers_rental['day'] >= 27) & (customers_rental['day'] <= 28), "holiday"] = 1
# 2012
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 10) \
& (customers_rental['day'] == 30), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 3) \
& (customers_rental['day'] == 2), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 6) \
& (customers_rental['day'] == 1), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 6) \
& (customers_rental['day'] == 29), "holiday"] = 1
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] == 5) \
& (customers_rental['day'] == 21), "holiday"] = 1
return customers_rental
def prepare_data(dataset):
customers_rental = dataset.copy()
dt = pd.DatetimeIndex(customers_rental['datetime'])
customers_rental.set_index(dt, inplace=True)
customers_rental['date'] = pd.to_datetime(customers_rental['datetime'])
customers_rental["year"] = customers_rental["date"].dt.year
customers_rental["month"] = customers_rental["date"].dt.month
customers_rental["day"] = customers_rental["date"].dt.day
customers_rental["hour"] = customers_rental["date"].dt.hour
customers_rental["dayofweek"] = customers_rental["date"].dt.dayofweek
customers_rental["weekofyear"] = customers_rental["date"].dt.weekofyear
customers_rental["season_cat"] = customers_rental["season"].map(seasonDict)
customers_rental["month_cat"] = customers_rental["month"].map(monthDict)
customers_rental["dayofweek_cat"] = customers_rental["dayofweek"].map(dayofweekDict)
customers_rental["weather_cat"] = customers_rental["weather"].map(weatherDict)
customers_rental['quarter'] = 0
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] > 0), 'quarter'] = 1
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] > 3), 'quarter'] = 2
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] > 6), 'quarter'] = 3
customers_rental.loc[(customers_rental['year'] == 2011) & (customers_rental['month'] > 9), 'quarter'] = 4
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] > 0), 'quarter'] = 5
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] > 3), 'quarter'] = 6
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] > 6), 'quarter'] = 7
customers_rental.loc[(customers_rental['year'] == 2012) & (customers_rental['month'] > 9), 'quarter'] = 8
# from EDA
customers_rental['bike_season'] = 0
customers_rental.loc[(customers_rental['month'] >= 4) & (customers_rental['month'] <= 11), 'bike_season'] = 1
# create derivated variables, used among others in outliers detecion
customers_rental = set_derivated_vars(customers_rental)
customers_rental = set_hour_cat(customers_rental)
customers_rental = set_independent_weather(customers_rental)
# train data has days 1-19 of each month, test data > 20
customers_rental['train'] = 0
customers_rental.loc[(customers_rental['day'] < 20), 'train'] = 1
# Apache Spark solver uses square loss function for minimalization.
# Our evalutation metric will be RMSLE so it's good idea to use logarithmic / boxcox transformation
# of dependent cols (adding 1 first so that 0 values don't become -inf)
#
# Dependent_variables_cat used for ChiSqSelector, which works only for classified labels
# lambda dictionary necessery for inverted boxcox transformation
for col in dependent_variables:
# we don't have labels for test data so set those columns to zero
customers_rental.loc[(customers_rental['day'] >= 20), col] = 0
customers_rental['%s_log' % col] = np.log(customers_rental[col] + 1)
# inititate columns for boxcox transformation
customers_rental['%s_bc' % col] = 0
customers_rental['%s_cat' % col] = 0
bc_lambda_dict = {}
for hour_cat in hour_cat_list:
lambdas = {}
for col in dependent_variables:
# boxcox transformation for 5 subsets in order to get better normal-like data distribution in each subset
# subsets are splitted by similarity on hour's impact on bike's rental
bc = boxcox(customers_rental[(customers_rental['hour_cat'] == hour_cat) & \
(customers_rental['train'] == 1)][col] +1)
# bc[0] has transformed data, bc[1] has used lambda
customers_rental.loc[(customers_rental['hour_cat'] == hour_cat) & (customers_rental['train'] == 1), \
'%s_bc' % col] = bc[0]
customers_rental.loc[(customers_rental['hour_cat'] == hour_cat) & (customers_rental['train'] == 1), \
'%s_cat' % col] = pd.cut(customers_rental[(customers_rental['hour_cat'] == hour_cat) & \
(customers_rental['train'] == 1)]['%s_bc' % col],\
bins=64, labels=range(0,64), include_lowest=True)
lambdas[col] = bc[1]
bc_lambda_dict[hour_cat] = lambdas
# drop unnecessary variables
customers_rental.drop(['datetime','date'], axis=1, inplace=True)
return customers_rental, bc_lambda_dict
We will use dataset with dummy variables for linear regression. We will use dataset without dummy varaibles for random forest regression and gradient boosted regression. bc_lambda_dict is dictionary for lambdas used in BoxCox transformation, and used for inverted BoxCox transformation.
all_no_dummy, bc_lambda_dict = prepare_data(train_df.append(test_df))
Columns structure (without dummies) after data preparation part. Later we will explore which subset of those columns use for machine learning models.
all_no_dummy.info()
all_no_dummy.describe()
Let's explore the data!
sns.set_palette("coolwarm")
sns.set_style('whitegrid')
Median of continuous variables looks reasonably. There are many outliers in windspeed variable and some outliers for zero humidity.
fig,(ax1,ax2)= plt.subplots(ncols=2, figsize=(12,4))
sns.boxplot(data=all_no_dummy[continuous_var], ax=ax1)
sns.boxplot(data=all_no_dummy[all_no_dummy['train'] == 1][dependent_variables], ax=ax2)
Let's find out what is distribution curves for dependent variables. Most of the current machine learning algorithmics performs best on normally distributed data. Pure distribution of depednent variables look more to be Poisson than Gaussian. Log transformation can helps a lot in converting data to normal-like distribution but BoxCox method do the best job by automatically detect proportion (lambda) of log and sqrt transformation.
fig,axes= plt.subplots(nrows=3, ncols=2, figsize=(11,10))
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['count'], bins = 24, ax=axes[0][0])
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['count_log'], bins = 24, ax=axes[0][1], color='blue')
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['casual'], bins = 24, ax=axes[1][0])
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['casual_log'], bins = 24, ax=axes[1][1], color='blue')
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['registered'], bins = 24, ax=axes[2][0])
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['registered_log'], bins = 24, ax=axes[2][1], color='blue')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
We can see that LOG distribution looks more normal-like but it's clearly skewed. BoxCox helps us deal with it. But first let's examine raw data divided into 8 hour categories.
for label in dependent_variables:
color='orange'
fig,axes= plt.subplots(nrows=2, ncols=4, figsize=(15,10))
if (label == 'registered'): color='red'
elif (label == 'count'): color='green'
for col in range(0,8):
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == col) & (all_no_dummy['train'] == 1)][label], bins = 24, ax=axes[(col // 4)][(col % 4)], axlabel=label + ', HOUR_CAT = ' + str(col),color=color)
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
Data are highly skewed. It's time to use explore BoxCox transformation.
for label in dependent_variables_bc:
color='orange'
fig,axes= plt.subplots(nrows=2, ncols=4, figsize=(15,10))
if (label == 'registered_bc'): color='red'
elif (label == 'count_bc'): color='green'
for col in range(0,8):
sns.distplot(all_no_dummy[(all_no_dummy['hour_cat'] == col) & (all_no_dummy['train'] == 1)][label], bins = 24, ax=axes[(col // 4)][(col % 4)], axlabel=label + ', HOUR_CAT = ' + str(col),color=color)
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
Right now only casual rental in hour_cat = 0 (from 0 a.m to 5 a.m., wd=1) and hour_cat = 5 (from 22 p.m till 1 a.m and 8 a.m., wd=0) are skewed toward zero. Rest of the distributions look good. Let's test those distribution with Kolmogorov-Smirnov Test. P-Value above 0.05 will reject distribution as to be Gaussian one.
for col in dependent_variables_bc:
print("Kolmogorov-Smirnov Test for " + col)
for hour_cat in hour_cat_list:
print("Hour_cat " + str(hour_cat) + ": " + str(kstest(all_no_dummy[(all_no_dummy['hour_cat'] == hour_cat) & (all_no_dummy['train'] == 1)][col], 'norm')))
print()
All distribution passed test, let's find out what is skew for those distributions. The closer to zero the better.
for col in dependent_variables:
print("Distribution skew for " + col)
for hour_cat in hour_cat_list:
print("Hour_cat " + str(hour_cat) + ": " + str(skew(all_no_dummy[(all_no_dummy['hour_cat'] == hour_cat) & (all_no_dummy['train'] == 1)]['%s_bc' % col])))
print()
Let's compare those results skews with raw data.
for col in dependent_variables:
print("Distribution skew for " + col)
for hour_cat in hour_cat_list:
print("Hour_cat " + str(hour_cat) + ": " + str(skew(all_no_dummy[(all_no_dummy['hour_cat'] == hour_cat) & (all_no_dummy['train'] == 1)][col])))
print()
Looks like we did big improvement. Good work.
Explore data of daily average of one hour bike rentals across all years 2011 and 2012. Sliding windows is of interval 7 days for dependent variables and 3 days for continuous independent variables. Default threshold is set to 75 percentage change computed as (a-b)/b.
outliers = {}
def plot_with_window(column, dataset=all_no_dummy, all=True, threshold=75):
if (all):
day_avg = 3
data = dataset
data_window_mean = data.rolling(window=day_avg * 24).mean()
data_window_std = data.rolling(window=day_avg * 24).std()
title = column + ' over time'
else:
day_avg = 7
data = dataset[all_no_dummy['train'] == 1].groupby(['year','month','day']).mean()
data_window_mean = data.rolling(window=day_avg).mean()
data_window_std = data.rolling(window=day_avg).std()
title = 'Daily mean of ' + column + ' over time'
diff = np.abs(data[column] - data_window_mean[column])
# diff for outliers candidates has to be at least above threshold % to avoid false outliers
outliers = diff[(diff > 3 * data_window_std[column]) & (diff > threshold)]
plt.figure(figsize=(9,4))
data[column].plot()
data_window_mean[column].plot(label=str(day_avg) + ' Day Avg', lw=1, ls='--', c='red')
plt.legend(loc='best')
plt.title(title)
plt.tight_layout()
return outliers
def unique_outliers(col_list):
outliers_list = []
for col in col_list:
for timestamp in outliers[col].index:
outliers_list.append(timestamp)
return sorted(list(set(outliers_list)))
Characterics of casual renting is very changeable with many positive peaks.
for col in dependent_variables:
outliers[col] = plot_with_window(col, all=False)
Plot for registred users looks almost the same becouse on average about 80% of total rentals are made be registered users.
all_no_dummy[all_no_dummy['train'] == 1][['casual', 'registered', 'count']].describe().ix['mean']
Thera are no outliers for dependent variables (by our criteria).
all_no_dummy.ix[unique_outliers(dependent_variables)]
Let's find out whether there are some weather data outliers.
temp_var_list = ['temp',
'temp_prev_hour_change_pct',
'temp_prev_day_change_pct']
for col in temp_var_list:
outliers[col] = plot_with_window(col)
Some of the outliers below.
all_no_dummy.ix[unique_outliers(temp_var_list), temp_var_list].head()
atemp_var_list = ['atemp',
'atemp_prev_hour_change_pct',
'atemp_prev_day_change_pct']
for col in atemp_var_list:
outliers[col] = plot_with_window(col)
all_no_dummy.ix[unique_outliers(atemp_var_list)][atemp_var_list].head()
outliers['atemp_temp_diff'] = plot_with_window('atemp_temp_diff', threshold=10)
all_no_dummy.ix[unique_outliers(['atemp_temp_diff'])]['atemp_temp_diff'].head()
We can see one clear outlier in humidity. During minium day for count variable (2011, 3, 6), humidity had one of its highest value. It matches our understanding - there are fewer bike rentals during rain. On (2011, 3, 10) humidity has zero daily mean value although in adjacent days there was high value for humidity. It looks like some missing values.
all_no_dummy.groupby(['year','month','day']).mean().idxmin()
humidity_var_list = ['humidity',
'humidity_prev_hour_change_pct',
'humidity_prev_day_change_pct']
for col in humidity_var_list:
outliers[col] = plot_with_window(col)
all_no_dummy.ix[unique_outliers(humidity_var_list)][humidity_var_list].head()
Windspeed becouse of its nature has very inregular characteriscs but in a way it has some stable mean level. Looks like this variable doesn't add value for our predictive models.
windspeed_var_list = ['windspeed',
'windspeed_prev_hour_change_pct',
'windspeed_prev_day_change_pct']
for col in windspeed_var_list:
outliers[col] = plot_with_window(col)
all_no_dummy.ix[unique_outliers(windspeed_var_list)][windspeed_var_list].head()
len(unique_outliers(humidity_var_list + temp_var_list + atemp_var_list + windspeed_var_list + ['atemp_temp_diff']))
def interpolate_data(dataset):
data = dataset.copy()
for col_list in [humidity_var_list] + [temp_var_list] + [atemp_var_list] + [windspeed_var_list]:
for col in col_list:
for timestamp in outliers[col].index:
data.loc[timestamp, col_list] = np.NaN
# avoid first row NaN
data.iloc[0] = data.iloc[0].fillna(data.mean())
data = data.interpolate(method='time')
# after interpolating continuous varialbes we must recompute derivated variables
data = set_derivated_vars(data)
data = set_temp_cat(data)
return data
all_no_dummy_interpolate = interpolate_data(all_no_dummy.copy())
Difference between original and interpolated data.
(all_no_dummy != all_no_dummy_interpolate).sum()
Some exmaple:
2012-01-10 05:00:00 - there were outliers for temp (16.40) and atemp (20.46). Those values were interpolated to (12.77) and (15.26) respectively.
all_no_dummy.loc[('2012-01-10 02:00:00'):('2012-01-10 07:00:00')][['temp', 'atemp']]
all_no_dummy_interpolate.ix[('2012-01-10 02:00:00'):('2012-01-10 07:00:00')][['temp', 'atemp']]
From the boxplot below its clear most bike rentals are in hours: 8, 17, 18. We should expect that regular registred users rent bikes in those hours becouse there are almost no outliers. Between those hours there are hours with many outliers. We should expect that more casual users rent bikes in those hours
plt.figure(figsize=(10,6))
sns.boxplot(x="hour", y="count", data=all_no_dummy[all_no_dummy['train'] == 1])
plt.tight_layout
On the plots below we can see rentals distribution for specifc month. Druing summer there are most demand for bikes. During all season hour-rental characteristics have the same shape: for working days thera are two peaks on 8 and 17-18 hours. Casual users rent bike usually between those hours as well as during weekends.
# below code is taken from https://www.kaggle.com/viveksrinivasan/eda-ensemble-model-top-10-percentile
fig,(ax1,ax2,ax3,ax4)= plt.subplots(nrows=4, figsize=(10,20))
#fig.set_size_inches(12,15)
sortOrder = ["January","February","March","April","May","June","July","August","September","October","November","December"]
hueOrder = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
monthAggregated = pd.DataFrame(all_no_dummy[all_no_dummy['train'] == 1].groupby("month_cat")["count"].mean()).reset_index()
monthSorted = monthAggregated.sort_values(by="count",ascending=False)
sns.barplot(data=monthSorted,x="month_cat",y="count",ax=ax1,order=sortOrder)
ax1.set(xlabel='Month', ylabel='Avearage Count',title="Average Count By Month")
hourAggregated = pd.DataFrame(all_no_dummy[all_no_dummy['train'] == 1].groupby(["hour","season_cat"],sort=True)["count"].mean()).reset_index()
sns.pointplot(x=hourAggregated["hour"], y=hourAggregated["count"],hue=hourAggregated["season_cat"], data=hourAggregated, join=True,ax=ax2)
ax2.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Season",label='big')
hourAggregated = pd.DataFrame(all_no_dummy[all_no_dummy['train'] == 1].groupby(["hour","dayofweek_cat"],sort=True)["count"].mean()).reset_index()
sns.pointplot(x=hourAggregated["hour"], y=hourAggregated["count"],hue=hourAggregated["dayofweek_cat"],hue_order=hueOrder, data=hourAggregated, join=True,ax=ax3)
ax3.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across Weekdays",label='big')
hourTransformed = pd.melt(all_no_dummy[all_no_dummy['train'] == 1][["hour","casual","registered"]], id_vars=['hour'], value_vars=['casual', 'registered'])
hourAggregated = pd.DataFrame(hourTransformed.groupby(["hour","variable"],sort=True)["value"].mean()).reset_index()
sns.pointplot(x=hourAggregated["hour"], y=hourAggregated["value"],hue=hourAggregated["variable"],hue_order=["casual","registered"], data=hourAggregated, join=True,ax=ax4)
ax4.set(xlabel='Hour Of The Day', ylabel='Users Count',title="Average Users Count By Hour Of The Day Across User Type",label='big')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
There are not clear linear relationships between variables, besides dependent variables (casual, registerd, count) themselves and temp - atemp. It indicates that machine learning regression algorithms that can handle non-linearity could perfom better than linear regression or there is a need to divide data into more linear parts.
sns.pairplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], vars=continuous_var + dependent_variables, hue='holiday', palette='coolwarm')
There is some correlation (showed on some next cells on heatmap) between casual and registred users rentals.
sns.pairplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], vars=['casual', 'registered'], palette='coolwarm')
It's clear linear relationships between temperature in Celsius and "feels like" temperature in Celsius'. We should probably drop one of them to avoid multicollinearity or use ensebled methods.
sns.jointplot(x='temp',y='atemp',data=all_no_dummy_interpolate)
There are potentially 25 outliers where absolute difference between temperature in Celsius and "feels like" temperature in Celsius is greater than 10 Celsius degree. We can build another linear regression model to predict atemp for those outliers. But taking into account fact, there are only 24 records, we just simply interpolate atemp_temp_diff variable for those records.
sns.jointplot(x='temp',y='atemp_temp_diff',data=all_no_dummy_interpolate)
all_no_dummy_interpolate[all_no_dummy_interpolate['atemp_temp_diff'] < -10][continuous_var + ['atemp_temp_diff']]
all_no_dummy_interpolate.loc[all_no_dummy_interpolate['atemp_temp_diff'] < -10 , \
['atemp','atemp_temp_diff']] = np.NaN
all_no_dummy.loc[all_no_dummy['atemp_temp_diff'] < -10 , \
['atemp','atemp_temp_diff']] = np.NaN
all_no_dummy_interpolate = all_no_dummy_interpolate.interpolate(method='time')
all_no_dummy = all_no_dummy.interpolate(method='time')
all_no_dummy_interpolate = set_derivated_vars(all_no_dummy_interpolate)
all_no_dummy = set_derivated_vars(all_no_dummy)
Right now there are no outliers for atemp_temp_diff variable.
all_no_dummy_interpolate[all_no_dummy_interpolate['atemp_temp_diff'] < -10][continuous_var + ['atemp_temp_diff']]
TRAIN data set - categorical variables.
fig,axes= plt.subplots(nrows=4, ncols=2, figsize=(11,10))
sns.countplot(x='holiday',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[0][0])
sns.countplot(x='season_cat',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[0][1])
sns.countplot(x='workingday',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[1][0])
sns.countplot(x='weather_cat',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[1][1])
sns.countplot(x='dayofweek',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[2][0])
sns.countplot(x='month',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[2][1])
sns.countplot(x='hour',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[3][0])
sns.countplot(x='temp_cat',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1], ax=axes[3][1])
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
TEST Kaggle data set - categorical variables.
fig,axes= plt.subplots(nrows=4, ncols=2, figsize=(11,10))
sns.countplot(x='holiday',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[0][0])
sns.countplot(x='season_cat',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[0][1])
sns.countplot(x='workingday',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[1][0])
sns.countplot(x='weather_cat',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[1][1])
sns.countplot(x='dayofweek',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[2][0])
sns.countplot(x='month',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[2][1])
sns.countplot(x='hour',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[3][0])
sns.countplot(x='temp_cat',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1], ax=axes[3][1])
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
Mainly difference between Train and Test caterogical variables distribution is a smaller amount of rows in February for test data set - it's due to fact in 2011 February had 28 days and in 2012 had 29 days.
TRAIN data set - continuous variables.
fig,axes= plt.subplots(nrows=4, ncols=4, figsize=(11,10))
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['temp'], bins=24, ax=axes[0][0], axlabel='TRAIN: temp', color='orange')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1]\
['temp'], bins=24, ax=axes[1][0], axlabel='TRAIN: temp_interpolate', color='orange')
sns.distplot(all_no_dummy[all_no_dummy['train'] != 1]['temp'], bins=24, ax=axes[2][0], axlabel='TEST: temp', color='orange')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1]\
['temp'], bins=24, ax=axes[3][0], axlabel='TEST: temp_interpolate', color='orange')
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['atemp'], bins=24, ax=axes[0][1], axlabel='TRAIN: atemp', color='red')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1]\
['atemp'], bins=24, ax=axes[1][1], axlabel='TRAIN: atemp_interpolate', color='red')
sns.distplot(all_no_dummy[all_no_dummy['train'] != 1]['atemp'], bins=24, ax=axes[2][1], axlabel='TEST: atemp', color='red')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1]\
['atemp'], bins=24, ax=axes[3][1], axlabel='TEST: atemp_interpolate', color='red')
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['humidity'], bins=24, ax=axes[0][2], axlabel='TRAIN: humidity', color='green')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1]\
['humidity'], bins=24, ax=axes[1][2], axlabel='TRAIN: humidity_interpolate', color='green')
sns.distplot(all_no_dummy[all_no_dummy['train'] != 1]['humidity'], bins=24, ax=axes[2][2], axlabel='TEST: humidity', color='green')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1]\
['humidity'], bins=24, ax=axes[3][2], axlabel='TEST: humidity_interpolate', color='green')
sns.distplot(all_no_dummy[all_no_dummy['train'] == 1]['windspeed'], bins=24, ax=axes[0][3], axlabel='TRAIN: windspeed', color='blue')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1]\
['windspeed'], bins=24, ax=axes[1][3], axlabel='TRAIN: windspeed_interpolate', color='blue')
sns.distplot(all_no_dummy[all_no_dummy['train'] != 1]['windspeed'], bins=24, ax=axes[2][3], axlabel='TEST: windspeed', color='blue')
sns.distplot(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1]\
['windspeed'], bins=24, ax=axes[3][3], axlabel='TEST: windspeed_interpolate', color='blue')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
We can see that both TRAIN and TEST data sets have very similar distribution for continuous and nominal variables so overall there is no need to select specific subset (similar to test data) from training data to train models on.
all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1].corr().ix[dependent_variables]
Heatmap is better visualisation tool to see some Pearson Correlation. There aren't strong correlations: hour and temp variable have approximately 0.4 Pearson Correlation value.
fig = plt.figure(figsize=(20,12))
sns.heatmap(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1].drop(dependent_variables + dependent_variables_cat + dependent_variables_log, axis=1).corr(), annot=False,cmap='coolwarm')
plt.tight_layout
We can see that linear regression can give us only roughly approximation of total rentals from hour variable.
sns.jointplot(x='hour',y='count',kind='reg',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1])
Temp-Count scatter plot is skewed in right direction which indicates that higher temperature results in higher total bike rentals. But again, we can see it's only roughly approximation.
sns.jointplot(x='temp',y='count',kind='reg',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1])
sns.jointplot(x='temp',y='count',kind='hex',data=all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1])
Let's have a close look on ditribution across different years.
train_no_dummy_year_piv = all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1].pivot_table(values='count',index='month',columns='year').rename(index=monthDict)
train_no_dummy_year_piv
It looks like there are much more rentals in 2012 than in 2011. Perhaps there were more bikes in total to rent in 2012 than in 2011. Another reason could be fact that people got used to rent a bike from Capital BikeShare and there were more registred users in total.
fig = plt.figure(figsize=(7,5))
sns.heatmap(train_no_dummy_year_piv,cmap='coolwarm')
train_no_dummy_month_piv = all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1].pivot_table(values='count',index='hour',columns='month').rename(columns=monthDict)
We can see that from April to November there is a bike season.
fig = plt.figure(figsize=(11,8))
sns.heatmap(train_no_dummy_month_piv,cmap='coolwarm')
Let's cluster hour and workingday impact on rentals.
train_no_dummy_workingday_piv = all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1].pivot_table(values='count',index='hour',columns=['workingday'])
sns.clustermap(train_no_dummy_workingday_piv,cmap='coolwarm',standard_scale=1,method='average')
all_no_dummy_interpolate[(all_no_dummy_interpolate['train'] == 1) & \
(all_no_dummy_interpolate['workingday'] == 1)].groupby('hour').mean()['count']
all_no_dummy_interpolate[(all_no_dummy_interpolate['train'] == 1) & \
(all_no_dummy_interpolate['workingday'] == 0)].groupby('hour').mean()['count']
We added those categories in variable hour_cat.
Let's cluster temp and workingday impact on rentals.
train_no_dummy_year_temp_piv = all_no_dummy_interpolate[all_no_dummy_interpolate['train'] == 1].round().pivot_table(values='count',index='temp',columns=['workingday']).dropna()
sns.clustermap(train_no_dummy_year_temp_piv,cmap='coolwarm',standard_scale=1,method='average')
We added those categories in variable temp_cat.
Here are three common evaluation metrics for regression problems:
Mean Absolute Error (MAE) is the mean of the absolute value of the errors:
$$\frac 1n\sum_{i=1}^n|y_i-\hat{y}_i|$$Mean Squared Error (MSE) is the mean of the squared errors:
$$\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2$$Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(y_i-\hat{y}_i)^2}$$Root Mean Squared Log Error (RMSLE) is the square root of the mean of the squared log errors:
$$\sqrt{\frac 1n\sum_{i=1}^n(log(y_i+1)-log(\hat{y}_i+1))^2}$$Comparing these metrics:
All of these are loss functions, because we want to minimize them.
def evaluatorMAE_own(predictions, labelCol):
diff = np.abs(predictions[labelCol] - predictions['prediction'])
mean_error = diff.mean()
return mean_error
def evaluatorMSE_own(predictions, labelCol):
diff = predictions[labelCol] - predictions['prediction']
mean_error = np.square(diff).mean()
return mean_error
def evaluatorRMSE_own(predictions, labelCol):
diff = predictions[labelCol] - predictions['prediction']
mean_error = np.square(diff).mean()
return np.sqrt(mean_error)
def evaluatorRMSLE_own(predictions, labelCol):
diff = np.log(predictions[labelCol] + 1) - np.log(predictions['prediction'] + 1)
mean_error = np.square(diff).mean()
return np.sqrt(mean_error)
def evaluatorR2_own(predictions, labelCol):
SS_res = np.sum(np.square(predictions[labelCol] - predictions['prediction']))
SS_tot = np.sum(np.square(predictions[labelCol] - predictions[labelCol].mean()))
R2 = 1 - SS_res/SS_tot
return R2
def undo_log_transform(pred,labelCol, Kaggle=False):
predictions = pred.copy()
# Kaggle test data doesn't have labels
if (not Kaggle):
predictions[labelCol] = np.exp(predictions[labelCol]) - 1
predictions['prediction'] = np.exp(predictions['prediction']) - 1
return predictions
def undo_bc_transform(pred,labelCol, lmbd, Kaggle=False):
predictions = pred.copy()
# Kaggle test data doesn't have labels
if (not Kaggle):
predictions[labelCol] = inv_boxcox(predictions[labelCol], lmbd) - 1
predictions['prediction'] = inv_boxcox(predictions['prediction'], lmbd) - 1
return predictions
def evaluateMetrics(preds, labelCol, lmbd, rollback=True, rounded=True):
predictions = preds.copy()
if (rollback):
predictions = undo_bc_transform(predictions, labelCol, lmbd)
if (rounded):
predictions = np.round(predictions)
# print("MAE: " + str('%.4f' % evaluatorMAE_own(predictions, labelCol)))
# print("MSE: " + str('%.4f' % evaluatorMSE_own(predictions, labelCol)))
# print("RMSE: " + str('%.4f' % evaluatorRMSE_own(predictions, labelCol)))
print("R2: " + str('%.4f' % evaluatorR2_own(predictions, labelCol)))
# print("----------------")
print("RMSLE: " + str('%.4f' % evaluatorRMSLE_own(predictions, labelCol)))
# print("----------------")
return evaluatorRMSLE_own(predictions, labelCol)
Drop unnecessary added category variables from training dataset. Apache Spark needs all data to be numerical. Later on we will perform feature selection to chose best variables for our models.
all_no_dummy_interpolate = all_no_dummy_interpolate.drop(added_cat_var, axis=1)
Dummy variables for linear regression.
all_dummy_interpolate = pd.get_dummies(data = all_no_dummy_interpolate.copy(), \
columns = categorical_var, drop_first=True)
Split training data into train and "test" to evaluate our models. Final model will predict labels for test_KAGGLE datasets.
trainingData_dummy = all_dummy_interpolate[all_dummy_interpolate['day'] <= 15]
testData_dummy = all_dummy_interpolate[(all_dummy_interpolate['day'] > 15) & (all_dummy_interpolate['day'] < 20)]
trainingData_no_dummy = all_no_dummy_interpolate[all_no_dummy_interpolate['day'] <= 15]
testData_no_dummy = all_no_dummy_interpolate[(all_no_dummy_interpolate['day'] > 15) & (all_no_dummy_interpolate['day'] < 20)]
Create Apache Spark Data Frames
spark_train_dummy = sqlContext.createDataFrame(trainingData_dummy)
spark_test_dummy = sqlContext.createDataFrame(testData_dummy)
spark_train_no_dummy = sqlContext.createDataFrame(trainingData_no_dummy)
spark_test_no_dummy = sqlContext.createDataFrame(testData_no_dummy)
Prepare Kaggle test data.
# reset index for including datetime as column
spark_Kaggle_test_dummy = sqlContext.createDataFrame(all_dummy_interpolate[all_dummy_interpolate['train'] != 1].reset_index())
spark_Kaggle_test_no_dummy = sqlContext.createDataFrame(all_no_dummy_interpolate[all_no_dummy_interpolate['train'] != 1].reset_index())
Apache Spark Machine Learning lib requires input as data frame transformed to labeled point.
from pyspark.ml.linalg import Vectors
DUMMY = True
columns_to_hide = added_cat_var + dependent_variables + dependent_variables_log \
+ dependent_variables_bc + dependent_variables_cat + ['train','index', 'hour_cat']
dummy_cols = [x for x in spark_train_dummy.columns if x not in columns_to_hide]
no_dummy_cols = [x for x in spark_train_no_dummy.columns if x not in columns_to_hide]
def transformToLabeledPoint(row) :
retArray=[]
if (DUMMY):
spark_col = dummy_cols
else:
spark_col = no_dummy_cols
for col in spark_col:
retArray.append(row[col])
hour_cat = row["hour_cat"]
label_count = row["count_bc"]
label_registered = row["registered_bc"]
label_casual = row["casual_bc"]
return hour_cat, label_count, label_registered, label_casual, Vectors.dense(retArray)
def transformToLabeledPointKaggle(row) :
retArray=[]
if (DUMMY):
spark_col = dummy_cols
else:
spark_col = no_dummy_cols
for col in spark_col:
retArray.append(row[col])
hour_cat = row["hour_cat"]
datetime = row["index"]
# lables must match this from transformToLabeledPoint for further unionAll() operation
return hour_cat, datetime,0,0, Vectors.dense(retArray)
ChiSqSelector used for feature selection.
def transformToLabeledPointChiSqSelector(row) :
retArray=[]
if (DUMMY):
spark_col = dummy_cols
else:
spark_col = no_dummy_cols
for col in spark_col:
retArray.append(row[col])
hour_cat = row["hour_cat"]
label_count_cat = row["count_cat"]
label_registered_cat = row["registered_cat"]
label_casual_cat = row["casual_cat"]
return hour_cat, label_count_cat, label_registered_cat, label_casual_cat, Vectors.dense(retArray)
Preparation Apache Spark data frame for linear regression
DUMMY = True
trainingData_dummy = sqlContext.createDataFrame(spark_train_dummy.rdd.map(transformToLabeledPoint), ["hour_cat", "label_count", "label_registered", "label_casual", "features_to_filter"])
testData_dummy = sqlContext.createDataFrame(spark_test_dummy.rdd.map(transformToLabeledPoint), ["hour_cat", "label_count", "label_registered", "label_casual", "features_to_filter"])
testKaggle_dummy = sqlContext.createDataFrame(spark_Kaggle_test_dummy.rdd.map(transformToLabeledPointKaggle), ["hour_cat", "label_count", "label_registered", "label_casual", "features_to_filter"])
chiSq_dummy = sqlContext.createDataFrame(spark_train_dummy.unionAll(spark_test_dummy)\
.rdd.map(transformToLabeledPointChiSqSelector), ["hour_cat", "label_count_cat", "label_registered_cat", "label_casual_cat", "features_to_filter"])
Preparation Apache Spark data frame for random forest and gradient boosted regression. No need for dummy variables.
DUMMY = False
trainingData_no_dummy = sqlContext.createDataFrame(spark_train_no_dummy.rdd.map(transformToLabeledPoint), ["hour_cat", "label_count", "label_registered", "label_casual", "features_to_filter"])
testData_no_dummy = sqlContext.createDataFrame(spark_test_no_dummy.rdd.map(transformToLabeledPoint), ["hour_cat", "label_count", "label_registered", "label_casual", "features_to_filter"])
testKaggle_no_dummy = sqlContext.createDataFrame(spark_Kaggle_test_no_dummy.rdd.map(transformToLabeledPointKaggle), ["hour_cat", "label_count", "label_registered", "label_casual","features_to_filter"])
chiSq_no_dummy = sqlContext.createDataFrame(spark_train_no_dummy.unionAll(spark_test_no_dummy)\
.rdd.map(transformToLabeledPointChiSqSelector), ["hour_cat", "label_count_cat", "label_registered_cat", "label_casual_cat", "features_to_filter"])
Cache spark data frames into memory for faster computation
trainingData_dummy.cache()
trainingData_dummy.count()
trainingData_no_dummy.cache()
trainingData_no_dummy.count()
testData_dummy.cache()
testData_dummy.count()
testData_no_dummy.cache()
testData_no_dummy.count()
testKaggle_dummy.cache()
testKaggle_dummy.count()
testKaggle_no_dummy.cache()
testKaggle_no_dummy.count()
To treat categorical variables as nominal we must first index them by featureIndexer. Features with number of unique values above maxCategories will be treated as continuous.
from pyspark.ml.feature import VectorIndexer
featureIndexer =\
VectorIndexer(inputCol="features_to_filter", outputCol="indexedFeatures", maxCategories=4) \
.fit(trainingData_no_dummy.unionAll(testData_no_dummy).unionAll(testKaggle_no_dummy))
Now its time to train and tune our model on training data using k-fold cross validation method!
from pyspark.ml import Pipeline
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.feature import ChiSqSelector
from pyspark.ml.linalg import Vectors
from pyspark.ml.tuning import TrainValidationSplit
def make_prediction(models_dict, feature_selector, hour_cat, dummy=False, Kaggle=False):
selector = feature_selector
# Kaggle test data is used only with no_dummy
if (dummy):
#selector = feature_selector_dummy[hour_cat]
if (Kaggle):
test_data = testKaggle_dummy.filter(testKaggle_dummy.hour_cat == hour_cat)
else:
test_data = testData_dummy.filter(testData_dummy.hour_cat == hour_cat)
elif (Kaggle):
test_data = featureIndexer.transform(testKaggle_no_dummy.filter(testKaggle_no_dummy.hour_cat == hour_cat))
else:
test_data = featureIndexer.transform(testData_no_dummy.filter(testData_no_dummy.hour_cat == hour_cat))
predictionsTestData_r = models_dict['label_registered'].transform(selector['label_registered'].transform(test_data))
predictionsTestData_c = models_dict['label_casual'].transform(selector['label_casual'].transform(test_data))
predictionsTestData_count = models_dict['label_count'].transform(selector['label_count'].transform(test_data))
return predictionsTestData_r, predictionsTestData_c, predictionsTestData_count
Function for evaluation metrics, support both functionalities: evaluate rounded predictions and not.
def evaluate_prediction(predictionsTestData_r, predictionsTestData_c, predictionsTestData_count, lambdas, rounded=True):
print("Evaluation prediction for registred users:")
pandas_pred_r = predictionsTestData_r.toPandas()
pandas_pred_r.loc[(pandas_pred_r['prediction'] < 0), 'prediction'] = 0
rmsle_r = evaluateMetrics(pandas_pred_r, 'label_registered', lmbd=lambdas['registered'], rounded=rounded)
print()
print("Evaluation prediction for casual users:")
pandas_pred_c = predictionsTestData_c.toPandas()
pandas_pred_c.loc[(pandas_pred_c['prediction'] < 0), 'prediction'] = 0
rmsle_c = evaluateMetrics(pandas_pred_c, 'label_casual', lmbd=lambdas['casual'], rounded=rounded)
print()
print("Evaluation prediction for sum of both models: registred + casual users:")
pandas_pred_c_undo_log = undo_bc_transform(pandas_pred_c,'label_casual', lmbd=lambdas['casual'])
pandas_pred_c_undo_log.loc[(pandas_pred_c_undo_log['prediction'] < 0), 'prediction'] = 0
pandas_pred_r_undo_log = undo_bc_transform(pandas_pred_r,'label_registered', lmbd=lambdas['registered'])
pandas_pred_r_undo_log.loc[(pandas_pred_r_undo_log['prediction'] < 0), 'prediction'] = 0
pandas_pred_sum = pd.DataFrame()
pandas_pred_sum['label_count'] = pandas_pred_c_undo_log['label_casual'] + pandas_pred_r_undo_log['label_registered']
pandas_pred_sum['prediction'] = pandas_pred_c_undo_log['prediction'] + pandas_pred_r_undo_log['prediction']
rmsle_sum = evaluateMetrics(pandas_pred_sum, 'label_count', lmbd={}, rollback=False, rounded=rounded)
print()
print("Evaluation prediction for one count users model:")
pandas_pred_count = predictionsTestData_count.toPandas()
pandas_pred_count.loc[(pandas_pred_count['prediction'] < 0), 'prediction'] = 0
rmsle_count = evaluateMetrics(pandas_pred_count, 'label_count', lmbd=lambdas['count'], rounded=rounded)
# We drop features variable in case to mix prediction from dummy and no_dummy models;
# dummy features vector has much more positions than no_dummy, so there would be problem to connect both.
# We just won't need features vector any more.
prediction_dict = {'label_registered' : undo_bc_transform(pandas_pred_r.drop(['features'], axis = 1).copy(), 'label_registered',lmbd=lambdas['registered']),
'label_casual': undo_bc_transform(pandas_pred_c.drop(['features'], axis = 1).copy(), 'label_casual',lmbd=lambdas['casual']),
'label_sum': pandas_pred_sum.copy(),
'label_count': undo_bc_transform(pandas_pred_count.drop(['features'], axis = 1).copy(),'label_count', lmbd=lambdas['count']),
'rmsle_r': rmsle_r,
'rmsle_c': rmsle_c,
'rmsle_sum': rmsle_sum,
'rmsle_count': rmsle_count}
return prediction_dict
def evaluate_all_prediction(all_registered, all_casual, all_sum, all_count, rounded=True):
print("Evaluation prediction for registred users:")
rmsle_r = evaluateMetrics(all_registered, 'label_registered', lmbd={}, rollback=False, rounded=rounded)
print()
print("Evaluation prediction for casual users:")
rmsle_c = evaluateMetrics(all_casual, 'label_casual', lmbd={}, rollback=False, rounded=rounded)
print()
print("Evaluation prediction for sum of both models: registred + casual users:")
rmsle_sum = evaluateMetrics(all_sum, 'label_count', lmbd={}, rollback=False, rounded=rounded)
print()
print("Evaluation prediction for one count users model:")
rmsle_count = evaluateMetrics(all_count, 'label_count', lmbd={}, rollback=False, rounded=rounded)
prediction_dict = {'rmsle_r': rmsle_r,
'rmsle_c': rmsle_c,
'rmsle_sum': rmsle_sum,
'rmsle_count': rmsle_count}
return prediction_dict
Function for mixing predictions from two different models together with respect to specifc ratio rate.
def evaluate_mixed_prediction(rf_all_sum, rf_all_count, gbt_all_sum, gbt_all_count, ratio=0.5, rounded=True):
print("Evaluation mixed (ratio=" + str(ratio) + ") prediction for sum of both models: registred + casual users:")
mixed_pandas_sum = ratio * rf_all_sum[['label_count', 'prediction']] + (1.0 - ratio) * gbt_all_sum[['label_count', 'prediction']]
rmsle_sum = evaluateMetrics(mixed_pandas_sum, 'label_count', lmbd={}, rollback=False, rounded=rounded)
print()
print("Evaluation mixed (ratio=" + str(ratio) + ") prediction for one count users model:")
mixed_pandas_count = ratio * rf_all_count[['label_count', 'prediction']] + (1.0 - ratio) * gbt_all_count[['label_count', 'prediction']]
rmsle_count = evaluateMetrics(mixed_pandas_count, 'label_count', lmbd={}, rollback=False, rounded=rounded)
mixed_prediction_dict = {'sum': mixed_pandas_sum.copy(),
'count': mixed_pandas_count.copy(),
'rmsle_sum' : rmsle_sum,
'rmsle_count' : rmsle_count}
return mixed_prediction_dict
Function for final prediction for Kaggle test dataset. Support mixing prediction with specific ratio. Final sum of predictions is rounded.
def predict_Kaggle_test(rf_predictionsTestData_r, rf_predictionsTestData_c, \
gbtr_predictionsTestData_r, gbtr_predictionsTestData_c, lambdas, ratio=0.5):
# 1. Although Kaggle test dataset does not have labels, all prediction are Spark dataframes
# with schema ["label_count", "label_registered", "label_casual", "features"]
# for coherent format with training data which enables us to use Spark featureIndexer during traingin models
# featureIndexer = VectorIndexer(inputCol="features", outputCol="indexedFeatures", maxCategories=24).fit(trainingData_no_dummy.unionAll(testData_no_dummy).unionAll(testKaggle_no_dummy))
#
# 2. We used column 'label_count' for convenient cache datatime variable used for submission file format
# pandas_pred_sum.rename(columns={"label_count": "datetime"})
#
# 3. Predicted values have to be numerical
# 4. labelCol is set to '' becouse Kaggle test dataset doesn't have one
rf_pandas_pred_r = rf_predictionsTestData_r.toPandas().drop(['features'], axis = 1).copy()
rf_pandas_pred_c = rf_predictionsTestData_c.toPandas().drop(['features'], axis = 1).copy()
gbtr_pandas_pred_r = gbtr_predictionsTestData_r.toPandas().drop(['features'], axis = 1).copy()
gbtr_pandas_pred_c = gbtr_predictionsTestData_c.toPandas().drop(['features'], axis = 1).copy()
rf_pandas_pred_r.loc[(rf_pandas_pred_r['prediction'] < 0), 'prediction'] = 0
rf_pandas_pred_c.loc[(rf_pandas_pred_c['prediction'] < 0), 'prediction'] = 0
gbtr_pandas_pred_r.loc[(gbtr_pandas_pred_r['prediction'] < 0), 'prediction'] = 0
gbtr_pandas_pred_c.loc[(gbtr_pandas_pred_c['prediction'] < 0), 'prediction'] = 0
rf_pandas_pred_r = undo_bc_transform(rf_pandas_pred_r, labelCol='', lmbd=lambdas['registered'], Kaggle=True)
rf_pandas_pred_c = undo_bc_transform(rf_pandas_pred_c, labelCol='', lmbd=lambdas['casual'], Kaggle=True)
gbtr_pandas_pred_r = undo_bc_transform(gbtr_pandas_pred_r, labelCol='', lmbd=lambdas['registered'], Kaggle=True)
gbtr_pandas_pred_c = undo_bc_transform(gbtr_pandas_pred_c, labelCol='', lmbd=lambdas['casual'], Kaggle=True)
pandas_pred_sum = rf_pandas_pred_r.copy()
pandas_pred_sum['count'] = 0.0
pandas_pred_sum['count'] = ratio * (rf_pandas_pred_r['prediction'] + rf_pandas_pred_c['prediction']) + \
(1.0 - ratio) * (gbtr_pandas_pred_r['prediction'] + gbtr_pandas_pred_c['prediction'])
pandas_pred_sum['count'] = np.round(pandas_pred_sum['count']).astype(int)
# other option to truncate floating point number instead of round
# pandas_pred_sum['count'] = pandas_pred_sum['count'].astype(int)
return pandas_pred_sum.rename(columns={"label_count": "datetime"})[['datetime','count']]
Function to compute score for mixed score for COUNT. Later on we will use better predictions: SUM (casual + registered separately) or COUNT. Unfortunatelly we can choose better one only by comparing results from Kaggle webstite. :-)
def predict_count_Kaggle_test(rf_predictionsTestData_c, gbtr_predictionsTestData_c, lambdas, ratio=0.5):
rf_pandas_pred_c = rf_predictionsTestData_c.toPandas().drop(['features'], axis = 1).copy()
gbtr_pandas_pred_c = gbtr_predictionsTestData_c.toPandas().drop(['features'], axis = 1).copy()
rf_pandas_pred_c.loc[(rf_pandas_pred_c['prediction'] < 0), 'prediction'] = 0
gbtr_pandas_pred_c.loc[(gbtr_pandas_pred_c['prediction'] < 0), 'prediction'] = 0
rf_pandas_pred_c = undo_bc_transform(rf_pandas_pred_c, labelCol='', lmbd=lambdas['count'], Kaggle=True)
gbtr_pandas_pred_c = undo_bc_transform(gbtr_pandas_pred_c, labelCol='', lmbd=lambdas['count'], Kaggle=True)
pandas_pred_sum = rf_pandas_pred_c.copy()
pandas_pred_sum['count'] = 0.0
pandas_pred_sum['count'] = ratio * (rf_pandas_pred_c['prediction']) + \
(1.0 - ratio) * (gbtr_pandas_pred_c['prediction'])
pandas_pred_sum['count'] = np.round(pandas_pred_sum['count']).astype(int)
# other option to truncate floating point number instead of round
# pandas_pred_sum['count'] = pandas_pred_sum['count'].astype(int)
return pandas_pred_sum.rename(columns={"label_count": "datetime"})[['datetime','count']]
Concatenate all hour_cat in one pandas dataFrame.
def concatenate_Kaggle_predictions(pred_sum_dict, pred_count_dict):
# initialize with data from first hour_cat and then append rest hour categories
all_sum = pred_sum_dict[0].copy()
all_count = pred_count_dict[0].copy()
# start from hour_cat = 1
for hour_cat in hour_cat_list[1:]:
all_sum = all_sum.append(pred_sum_dict[hour_cat])
all_count = all_count.append(pred_count_dict[hour_cat])
return all_sum.reset_index(drop=True), all_count.reset_index(drop=True)
def concatenate_predictions(pred_dict):
# initialize with data from first hour_cat and then append rest hour categories
all_registered = pred_dict[0]['label_registered'].copy()
all_casual = pred_dict[0]['label_casual'].copy()
all_sum = pred_dict[0]['label_sum'].copy()
all_count = pred_dict[0]['label_count'].copy()
# start from hour_cat = 1
for hour_cat in hour_cat_list[1:]:
all_registered = all_registered.append(pred_dict[hour_cat]['label_registered'])
all_casual = all_casual.append(pred_dict[hour_cat]['label_casual'])
all_sum = all_sum.append(pred_dict[hour_cat]['label_sum'])
all_count = all_count.append(pred_dict[hour_cat]['label_count'])
return all_registered.reset_index(drop=True), all_casual.reset_index(drop=True), all_sum.reset_index(drop=True), all_count.reset_index(drop=True)
Decision trees are a popular family of classification and regression methods. We will use them as step before ensemled models training, to verify subsets of features to minimalize loss function.
Let's find out how number of chosen features impacts RMSLE metric.
dt = DecisionTreeRegressor(maxDepth=30, minInstancesPerNode=10, featuresCol="features")
dt_cv_models_dict = {}
selector_no_dummy_dict = {}
rmsle_hour_cat = {}
Code below is commented becouse it's time consuming. Instead, we load by pickle previously computed results.
# for hour_cat in hour_cat_list:
# rmsle_registered = []
# rmsle_casual = []
# rmsle_sum = []
# rmsle_count = []
# for num in range(1,len(no_dummy_cols) + 1):
# for label in ["label_count", "label_registered", "label_casual"]:
# selector_no_dummy = ChiSqSelector(selectorType="numTopFeatures", fpr=0.01, numTopFeatures=num, \
# featuresCol="indexedFeatures",outputCol="features", labelCol=label + "_cat")
# selected_features_no_dummy = selector_no_dummy.fit(featureIndexer \
# .transform(chiSq_no_dummy.filter(chiSq_no_dummy.hour_cat == hour_cat)))
# pipeline = Pipeline(stages=[featureIndexer, selected_features_no_dummy, dt.setLabelCol(label)])
# model = pipeline.fit(trainingData_no_dummy.filter(trainingData_no_dummy.hour_cat == hour_cat))
# dt_cv_models_dict[label] = model
# selector_no_dummy_dict[label] = selected_features_no_dummy
# print("\n==========================================")
# print("Category: " + str(hour_cat) + ", using " + str(num) + " best features")
# print("==========================================")
# dt_predictionsTestData_r, dt_predictionsTestData_c, dt_predictionsTestData_count = make_prediction(dt_cv_models_dict, selector_no_dummy_dict, hour_cat, dummy=False)
# dt_pred_dict = evaluate_prediction(dt_predictionsTestData_r, dt_predictionsTestData_c, dt_predictionsTestData_count, lambdas=bc_lambda_dict[hour_cat])
# rmsle_registered.append(dt_pred_dict['rmsle_r'])
# rmsle_casual.append(dt_pred_dict['rmsle_c'])
# rmsle_sum.append(dt_pred_dict['rmsle_sum'])
# rmsle_count.append(dt_pred_dict['rmsle_count'])
# rmsle_hour_cat[hour_cat] = {'registered' : rmsle_registered,
# 'casual' : rmsle_casual,
# 'sum': rmsle_sum,
# 'count' : rmsle_count}
Load previously computed metrics score.
#pickle.dump(rmsle_hour_cat, open( "../data/rmsle_hour_cat.p", "wb" ))
rmsle_hour_cat = pickle.load( open( "../data/rmsle_hour_cat.p", "rb" ) )
From the plots below it's clear that different features number with the smallest p-Value do the job for different hour categories. Increasing number of features above some some level doesn't lead to any metric improvemnt and there is a risk of overfitting. Feature 'hour' - has the biggest impact on minimizing final score.
for hour_cat in hour_cat_list:
fig = plt.figure(figsize=(12,4))
plt.plot(range(1,len(no_dummy_cols) + 1), rmsle_hour_cat[hour_cat]['registered'], label='Registered')
plt.plot(range(1,len(no_dummy_cols) + 1), rmsle_hour_cat[hour_cat]['casual'] , label='Casual')
plt.plot(range(1,len(no_dummy_cols) + 1), rmsle_hour_cat[hour_cat]['sum'], label='Sum', c='red', marker='o', markerfacecolor='None')
plt.plot(range(1,len(no_dummy_cols) + 1), rmsle_hour_cat[hour_cat]['count'], label='Count')
plt.xlabel("Number of sorted features by p-Value (ascending)")
plt.ylabel("RMSLE")
plt.title("hour_cat " + str(hour_cat) + ": Trade off - number of features used by model")
plt.legend()
From the plots above we can set number of features for each hour_cat separately to minimize Sum RMSLE. For same cases we should use all features to minimize RMSLE best, but in those scenarios models are proned to overfitting. We use threshold for number of chosen features. We set min number features useful for random forest regression.
min_num_features = 12
max_num_features = 17
feat_num_hour_cat = {}
for hour_cat in hour_cat_list:
feat_num_hour_cat[hour_cat] = {'label_registered' : rmsle_hour_cat[hour_cat]['registered'][:max_num_features]\
.index(min(rmsle_hour_cat[hour_cat]['registered'][:max_num_features])) + 1,
'label_casual' : rmsle_hour_cat[hour_cat]['casual'][:max_num_features]\
.index(min(rmsle_hour_cat[hour_cat]['casual'][:max_num_features])) + 1,
'label_count' : rmsle_hour_cat[hour_cat]['count'][:max_num_features]\
.index(min(rmsle_hour_cat[hour_cat]['count'][:max_num_features])) + 1}
for hour_cat in hour_cat_list:
for label in ['label_registered', 'label_casual', 'label_count']:
if (feat_num_hour_cat[hour_cat][label] < min_num_features): feat_num_hour_cat[hour_cat][label] = min_num_features
feat_num_hour_cat
Prepare ChiSquare Selector ( https://en.wikipedia.org/wiki/Chi-squared_test ) to choose best features to our models based on feat_num_hour_cat. For dummy option select all features with p-Value below 0.05
feature_selector_dummy = {}
feature_selector_no_dummy = {}
for hour_cat in hour_cat_list:
feat_selector_no_dummy = {}
feat_selector_dummy = {}
for label in ["label_count", "label_registered", "label_casual"]:
# for dummy option select features with p-Value below 0.05
selector_dummy = ChiSqSelector( selectorType="fpr", percentile=0.5, fpr=0.05, featuresCol="features_to_filter",
outputCol="features", labelCol=label + "_cat")
# we would like to connect selector with feature indexer so we must fit selector to
# indexed data before we will be able to transform indexed data
selector_no_dummy = ChiSqSelector(selectorType="numTopFeatures", fpr=0.05, \
numTopFeatures=feat_num_hour_cat[hour_cat][label], featuresCol="indexedFeatures", \
outputCol="features", labelCol=label + "_cat")
selected_features_no_dummy = selector_no_dummy.fit(featureIndexer \
.transform(chiSq_no_dummy.filter(chiSq_no_dummy.hour_cat == hour_cat)))
selected_features_dummy = selector_dummy.fit(chiSq_dummy.filter(chiSq_dummy.hour_cat == hour_cat))
feat_selector_no_dummy[label] = selected_features_no_dummy
feat_selector_dummy[label] = selected_features_dummy
# print selected columns for no_dummy option
print(label + "_no_dummy, cat_" + str(hour_cat) + ": ChiSqSelector output with %d features selected below %.2f p-Value" \
% (len(selected_features_no_dummy.selectedFeatures), selector_no_dummy.getFpr()))
for ind in selected_features_no_dummy.selectedFeatures:
print("\t- " + no_dummy_cols[ind])
print("")
feature_selector_no_dummy[hour_cat] = feat_selector_no_dummy
feature_selector_dummy[hour_cat] = feat_selector_dummy
Verify final feature selection results on Decision Tree model.
# Train a DecisionTree model.
dt = DecisionTreeRegressor(maxDepth=30, minInstancesPerNode=10, featuresCol="features")
dt_cv_models_dict = {}
for hour_cat in hour_cat_list:
dt_cv_models = {}
for label in ['label_registered','label_casual', 'label_count']:
pipeline = Pipeline(stages=[featureIndexer, feature_selector_no_dummy[hour_cat][label], dt.setLabelCol(label)])
model = pipeline.fit(trainingData_no_dummy.filter(trainingData_no_dummy.hour_cat == hour_cat))
dt_cv_models[label] = model
print("Train for hour_cat " + str(hour_cat) + " and " + label + " done.")
dt_cv_models_dict[hour_cat] = dt_cv_models
dt_pred_dict = {}
for hour_cat in hour_cat_list:
print("\n==========================================")
print("Category: " + str(hour_cat))
print("==========================================")
dt_predictionsTestData_r, dt_predictionsTestData_c, dt_predictionsTestData_count = make_prediction(dt_cv_models_dict[hour_cat], feature_selector_no_dummy[hour_cat], hour_cat, dummy=False)
dt_pred_dict[hour_cat] = evaluate_prediction(dt_predictionsTestData_r, dt_predictionsTestData_c, dt_predictionsTestData_count, lambdas=bc_lambda_dict[hour_cat])
dt_all_registered, dt_all_casual, dt_all_sum, dt_all_count = concatenate_predictions(dt_pred_dict)
dt_all_evaluated = evaluate_all_prediction(dt_all_registered, dt_all_casual, dt_all_sum, dt_all_count)
We got result 0.3914 on simple decision tree. Looks primising.
We now treat the Pipeline as an Estimator, wrapping it in a CrossValidator instance. This will allow us to jointly choose parameters for all Pipeline stages. A CrossValidator requires an Estimator, a set of Estimator ParamMaps, and an Evaluator. We use a ParamGridBuilder to construct a grid of parameters to search over.
| regularizer $R(w)$ | gradient or sub-gradient | |
|---|---|---|
| zero (unregularized) | 0 | 0 |
| L2 | $\frac{1}{2}\|w\|^2$ | $w$ |
| L1 | $\|w\|$ | $\mathrm{sign}(w)$ |
| elastic net | $\alpha \|w\| + (1-\alpha)\frac{1}{2}\|w\|^2$ | $\alpha \mathrm{sign}(w) + (1-\alpha) w$ |
lr_cv_models_dict = {}
lr = LinearRegression()
regParam = [1.0, 0.5, 0.1, 0.01, 0.001]
elasticNetParam = [0, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
for hour_cat in hour_cat_list:
lr_cv_models = {}
for label in ['label_registered','label_casual', 'label_count']:
pipeline = Pipeline(stages=[lr])
paramGrid = ParamGridBuilder() \
.addGrid(lr.regParam, regParam) \
.addGrid(lr.elasticNetParam, elasticNetParam) \
.addGrid(lr.labelCol, [label]) \
.build()
crossval = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(labelCol=label, metricName="rmse"),
trainRatio=0.75) # (5x10) = 50 models to check
start = time.time()
# Run cross-validation, and choose the best set of parameters on dummy dataset.
# trainingData_dummy + testData_dummy = whole Kaggle Train data, we separarated this sets to simply evaluate models
# we cross-validate on whole Kaggle train data to find best params to predict whole Kaggle Test set
# use only features chosen before by ChiSqSelector
cvModel = crossval.fit(feature_selector_dummy[hour_cat][label].transform( \
trainingData_dummy.filter(trainingData_dummy.hour_cat == hour_cat)\
.unionAll(testData_dummy.filter(testData_dummy.hour_cat == hour_cat))))
end = time.time()
print("======= CV for hour_cat " + str(hour_cat) + " and " + label + " =========")
print("Time taken to train model with k-fold cv: " + str(end - start) + " seconds")
bestModel = cvModel.bestModel.stages[0]._java_obj
print("RegParam: " + str(bestModel.getRegParam()))
print("ElasticNetParam: " + str(bestModel.getElasticNetParam()))
lr_cv_models[label] = cvModel
print()
lr_cv_models_dict[hour_cat] = lr_cv_models
Now that we have fit our model, let's evaluate its performance by predicting off the test values with the best values from cross validation!.
lr_pred_dict = {}
for hour_cat in hour_cat_list:
print("\n==========================================")
print("Category: " + str(hour_cat))
print("==========================================")
lr_predictionsTestData_r, lr_predictionsTestData_c, lr_predictionsTestData_count = make_prediction(lr_cv_models_dict[hour_cat], feature_selector_dummy[hour_cat], hour_cat, dummy=True)
lr_pred_dict[hour_cat] = evaluate_prediction(lr_predictionsTestData_r, lr_predictionsTestData_c, lr_predictionsTestData_count, lambdas=bc_lambda_dict[hour_cat])
Let's evaluate our model performance by calculating the evaluation metrics. We achieved better result 0.3454 vs 0.3914 from simple decision tree.
lr_all_registered, lr_all_casual, lr_all_sum, lr_all_count = concatenate_predictions(lr_pred_dict)
lr_all_evaluated = evaluate_all_prediction(lr_all_registered, lr_all_casual, lr_all_sum, lr_all_count)
Create a scatterplot of the real test values versus the predicted values.
Let's quickly explore the residuals to make sure everything was okay with our data. It's good idea to plot a histogram of the residuals and make sure it looks normally distributed.
fig,axes= plt.subplots(nrows=1, ncols=2, figsize=(12,5))
axes[0].scatter(lr_all_sum['label_count'],lr_all_sum['prediction'])
axes[0].set_xlabel('Y Test')
axes[0].set_ylabel('Predicted Y')
axes[0].set_title('Linear Regression')
#sns.distplot((lr_all_sum['label_count'] - lr_all_sum['prediction']),bins=64, ax=axes[1]);
axes[1].hist(lr_all_sum['label_count'] - lr_all_sum['prediction'],bins=64)
axes[1].set_xlabel('Linear Regression Residuals')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
Random forests are ensembles of decision trees. Random forests are one of the most successful machine learning models for classification and regression. They combine many decision trees in order to reduce the risk of overfitting. Like decision trees, random forests handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions.
Random forests train a set of decision trees separately, so the training can be done in parallel. The algorithm injects randomness into the training process so that each decision tree is a bit different. Combining the predictions from each tree reduces the variance of the predictions, improving the performance on test data.
The randomness injected into the training process includes:
Apart from these randomizations, decision tree training is done in the same way as for individual decision trees.To make a prediction on a new instance, a random forest must aggregate the predictions from its set of decision trees. This aggregation is done differently for classification and regression.For regression each tree predicts a real value. The label is predicted to be the average of the tree predictions.
rf_cv_models_dict = {}
rf = RandomForestRegressor()
numTrees = [500, 1000]
maxDepth = [10, 15, 30]
minInstancesPerNode = [10]
maxBins = [32]
for hour_cat in hour_cat_list:
rf_cv_models = {}
for label in ['label_registered','label_casual', 'label_count']:
pipeline = Pipeline(stages=[rf])
paramGrid = ParamGridBuilder() \
.addGrid(rf.numTrees, numTrees) \
.addGrid(rf.maxDepth, maxDepth) \
.addGrid(rf.minInstancesPerNode, minInstancesPerNode) \
.addGrid(rf.maxBins, maxBins) \
.addGrid(rf.labelCol, [label]) \
.build()
crossval = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(labelCol=label, metricName="rmse"),
trainRatio=0.75) # (2x3) = 4 models to check
start = time.time()
cvModel = crossval.fit(feature_selector_no_dummy[hour_cat][label].transform(\
featureIndexer.transform(trainingData_no_dummy.filter(trainingData_no_dummy.hour_cat == hour_cat)\
.unionAll(testData_no_dummy.filter(testData_no_dummy.hour_cat == hour_cat)))))
print("======= CV for hour_cat " + str(hour_cat) + " and " + label + " =========")
end = time.time()
print("Time taken to train model with k-fold cv: " + str(end - start) + " seconds")
bestModel = cvModel.bestModel.stages[0]._java_obj
print("NumTrees: " + str(bestModel.getNumTrees()))
print("MaxDepth: " + str(bestModel.getMaxDepth()))
print("MinInstancesPerNode: " + str(bestModel.getMinInstancesPerNode()))
print("MaxBins: " + str(bestModel.getMaxBins()))
rf_cv_models[label] = cvModel
print()
rf_cv_models_dict[hour_cat] = rf_cv_models
# rf = RandomForestRegressor(numTrees=500, maxDepth=15, minInstancesPerNode=10, featuresCol="features")
# for hour_cat in hour_cat_list:
# rf_cv_models = {}
# for label in ['label_registered','label_casual', 'label_count']:
# pipeline = Pipeline(stages=[featureIndexer, feature_selector_no_dummy[hour_cat][label], rf.setLabelCol(label)])
# model = pipeline.fit(trainingData_no_dummy.filter(trainingData_no_dummy.hour_cat == hour_cat))
# rf_cv_models[label] = model
# print("Train for hour_cat " + str(hour_cat) + " and " + label + " done.")
# rf_cv_models_dict[hour_cat] = rf_cv_models
Evaluate results in each hour_cat separately.
rf_pred_dict = {}
for hour_cat in hour_cat_list:
print("\n==========================================")
print("Category: " + str(hour_cat))
print("==========================================")
rf_predictionsTestData_r, rf_predictionsTestData_c, rf_predictionsTestData_count = make_prediction(rf_cv_models_dict[hour_cat], feature_selector_no_dummy[hour_cat], hour_cat, dummy=False)
rf_pred_dict[hour_cat] = evaluate_prediction(rf_predictionsTestData_r, rf_predictionsTestData_c, rf_predictionsTestData_count, lambdas=bc_lambda_dict[hour_cat])
Evaluate concatenated results. We achieved better result 0.3086 vs 0.3509 from cv linear regression. Looks even better.
rf_all_registered, rf_all_casual, rf_all_sum, rf_all_count = concatenate_predictions(rf_pred_dict)
rf_all_evaluated = evaluate_all_prediction(rf_all_registered, rf_all_casual, rf_all_sum, rf_all_count)
Residuals should have normal distribution. Looks good.
fig,axes= plt.subplots(nrows=1, ncols=2, figsize=(12,5))
axes[0].scatter(rf_all_sum['label_count'],rf_all_sum['prediction'])
axes[0].set_xlabel('Y Test')
axes[0].set_ylabel('Predicted Y')
axes[0].set_title('Random Forest Regression')
axes[1].hist(rf_all_sum['label_count'] - rf_all_sum['prediction'],bins=64)
axes[1].set_xlabel('Random Forest Regression Residuals')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
Gradient-Boosted Trees (GBTs) are ensembles of decision trees. GBTs iteratively train decision trees in order to minimize a loss function. Like decision trees, GBTs handle categorical features, extend to the multiclass classification setting, do not require feature scaling, and are able to capture non-linearities and feature interactions.
Both Gradient-Boosted Trees (GBTs) and Random Forests are algorithms for learning ensembles of trees, but the training processes are different. There are several practical trade-offs:
In short, both algorithms can be effective, and the choice should be based on the particular dataset.
gbt_cv_models_dict = {}
gbt = GBTRegressor()
maxDepth = [4]
maxIter = [30, 60, 90]
minInstancesPerNode = [10]
maxBins = [32]
for hour_cat in hour_cat_list:
gbt_cv_models = {}
for label in ['label_registered','label_casual', 'label_count']:
pipeline = Pipeline(stages=[gbt])
paramGrid = ParamGridBuilder() \
.addGrid(gbt.minInstancesPerNode, minInstancesPerNode) \
.addGrid(gbt.maxDepth, maxDepth) \
.addGrid(gbt.maxIter, maxIter) \
.addGrid(gbt.maxBins, maxBins) \
.addGrid(gbt.labelCol, [label]) \
.build()
crossval = TrainValidationSplit(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=RegressionEvaluator(labelCol=label, metricName="rmse"),
trainRatio=0.8) # (3x3)x3 = 27 models to check
start = time.time()
cvModel = crossval.fit(feature_selector_no_dummy[hour_cat][label].transform(\
featureIndexer.transform(trainingData_no_dummy.filter(trainingData_no_dummy.hour_cat == hour_cat)\
.unionAll(testData_no_dummy.filter(testData_no_dummy.hour_cat == hour_cat)))))
print("======= CV for hour_cat " + str(hour_cat) + " and " + label + " =========")
end = time.time()
print("Time taken to train model with k-fold cv: " + str(end - start) + " seconds")
bestModel = cvModel.bestModel.stages[0]._java_obj
print("MaxIter: " + str(bestModel.getMaxIter()))
print("MaxDepth: " + str(bestModel.getMaxDepth()))
print("MinInstancesPerNode: " + str(bestModel.getMinInstancesPerNode()))
print("MaxBins: " + str(bestModel.getMaxBins()))
gbt_cv_models[label] = cvModel
print()
gbt_cv_models_dict[hour_cat] = gbt_cv_models
gbt_cv_models_dict
Cross validating GBTRegressor is very painfull so we train all hour_cat wiht common params. Probably loosing some accuracy.
# gbt = GBTRegressor(maxDepth=4, minInstancesPerNode=10, maxIter=100, featuresCol="features")
# for hour_cat in hour_cat_list:
# gbt_cv_models = {}
# for label in ['label_registered','label_casual', 'label_count']:
# pipeline = Pipeline(stages=[featureIndexer, feature_selector_no_dummy[hour_cat][label], gbt.setLabelCol(label)])
# model = pipeline.fit(trainingData_no_dummy.filter(trainingData_no_dummy.hour_cat == hour_cat))
# gbt_cv_models[label] = model
# print("Train for hour_cat " + str(hour_cat) + " and " + label + " done.")
# gbt_cv_models_dict[hour_cat] = gbt_cv_models
Evaluate results in each hour_cat separately.
gbt_pred_dict = {}
for hour_cat in hour_cat_list:
print("\n==========================================")
print("Category: " + str(hour_cat))
print("==========================================")
gbt_predictionsTestData_r, gbt_predictionsTestData_c, gbt_predictionsTestData_count = make_prediction(gbt_cv_models_dict[hour_cat], feature_selector_no_dummy[hour_cat], hour_cat, dummy=False)
gbt_pred_dict[hour_cat] = evaluate_prediction(gbt_predictionsTestData_r, gbt_predictionsTestData_c, gbt_predictionsTestData_count, lambdas=bc_lambda_dict[hour_cat])
Evaluate concatenated results.
gbt_all_registered, gbt_all_casual, gbt_all_sum, gbt_all_count = concatenate_predictions(gbt_pred_dict)
gbt_all_evaluated = evaluate_all_prediction(gbt_all_registered, gbt_all_casual, gbt_all_sum, gbt_all_count)
Visualize results.
fig,axes= plt.subplots(nrows=3, ncols=2, figsize=(11,7))
sns.distplot(gbt_all_registered['label_registered'], bins = 24, ax=axes[0][0])
sns.distplot(gbt_all_registered['prediction'], bins = 24, ax=axes[0][1], color='blue')
sns.distplot(gbt_all_casual['label_casual'], bins = 24, ax=axes[1][0])
sns.distplot(gbt_all_casual['prediction'], bins = 24, ax=axes[1][1], color='blue')
sns.distplot(gbt_all_count['label_count'], bins = 24, ax=axes[2][0])
sns.distplot(gbt_all_count['prediction'], bins = 24, ax=axes[2][1], color='blue')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
Hour category 0 and 5 has the biggest RMSLE score: 0.3315 and 0.3702 accordingly.
fig,axes= plt.subplots(nrows=8, ncols=2, figsize=(11,19))
for hour_cat in hour_cat_list:
sns.distplot(gbt_all_registered[gbt_all_registered['hour_cat'] == hour_cat]['label_registered'], bins = 24, ax=axes[hour_cat][0], axlabel='HOUR_CAT ' + str(hour_cat) + ', registered', color='orange')
sns.distplot(gbt_all_registered[gbt_all_registered['hour_cat'] == hour_cat]['prediction'], bins = 24, ax=axes[hour_cat][1], color='green')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
fig,axes= plt.subplots(nrows=8, ncols=2, figsize=(11,19))
for hour_cat in hour_cat_list:
sns.distplot(gbt_all_registered[gbt_all_registered['hour_cat'] == hour_cat]['label_casual'], bins = 24, ax=axes[hour_cat][0], axlabel='HOUR_CAT ' + str(hour_cat) + ', registered', color='red')
sns.distplot(gbt_all_registered[gbt_all_registered['hour_cat'] == hour_cat]['prediction'], bins = 24, ax=axes[hour_cat][1], color='green')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
fig,axes= plt.subplots(nrows=8, ncols=2, figsize=(11,19))
for hour_cat in hour_cat_list:
sns.distplot(gbt_all_registered[gbt_all_registered['hour_cat'] == hour_cat]['label_count'], bins = 24, ax=axes[hour_cat][0], axlabel='HOUR_CAT ' + str(hour_cat) + ', registered', color='blue')
sns.distplot(gbt_all_registered[gbt_all_registered['hour_cat'] == hour_cat]['prediction'], bins = 24, ax=axes[hour_cat][1], color='green')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
fig,axes= plt.subplots(nrows=1, ncols=2, figsize=(12,5))
axes[0].scatter(gbt_all_sum['label_count'],gbt_all_sum['prediction'])
axes[0].set_xlabel('Y Test')
axes[0].set_ylabel('Predicted Y')
axes[0].set_title('Gradient Boosted Tree Regression')
axes[1].hist(rf_all_sum['label_count'] - rf_all_sum['prediction'],bins=64)
axes[1].set_xlabel('Gradient Boosted Tree Regression Residuals')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
We trained three machine learning algorithms. We used K-fold cross validation for parameter tuning. Of course we couldn't check all posibly combination due to computancy limitation. Especially Gradient Boosted Regression was hard to train becouse of its iterative manner. On the other hand it predicts very fast - this is one of the reason of its popularity (ex. for page rank cases). As we might suspect, non-linearity models gave us better results. Gradient Boosted Regression won the game with the score 0.2544 !
RMSLE |
||||
|---|---|---|---|---|
| registered | casual | registered + casual | count | |
| Linear Regression | 0.3366 |
0.5566 | 0.3454 |
0.3474 |
| Random Forest Regression | 0.3024 |
0.4540 | 0.3086 |
0.3112 |
| Gradient Boosted Regression | 0.2568 |
0.3954 | 0.2544 |
0.2577 |
Training two models to predict casual and registred rentals accordingly gave us slightly better results. In all models prediction for casual users were significant worse than for registered users. It follows our understanding that there is no strict rule for renting bikes for casual users / there is some other variable, not present in our dataset, that could better model casual renting.
RMSLE can be used when we don’t want to penalize huge differences when both the values are huge numbers, on the other hand it's harder to compare those results to others. That's why we also measured $R^2$ metric which in a way is standariezed.
$R^2$ |
||||
|---|---|---|---|---|
| registered | casual | registered + casual | count | |
| Linear Regression | 0.9207 |
0.8718 | 0.9190 |
0.9214 |
| Random Forest Regression | 0.9405 |
0.9025 | 0.9396 |
0.9416 |
| Gradient Boosted Regression | 0.9533 |
0.9374 | 0.9567 |
0.9540 |
Depends on $R^2$ value we assume that compatibility is:
It's worth to notice that combinig two models (for registered and casual) results in better accuracy than each of them had separately.
RMSLE |
||||
|---|---|---|---|---|
| registered | casual | registered + casual | count | |
| Random Forest Regression | 0.2568 |
0.3954 |
0.2544 |
0.2577 |
So it's good idea to combine prediction from two / three machine learning algorithms: Linear Regression, Random Forest Regression and Gradient Boosted Regression
We need to plot curve ratio to RMSLE to verify the split proportion for both SUM and COUNT prediction.
rmsle_sum_rf_gbt = []
rmsle_count_rf_gbt = []
rmsle_sum_lr_rf = []
rmsle_count_lr_rf = []
rmsle_sum_lr_gbt = []
rmsle_count_lr_gbt = []
ratio_array = np.linspace(0.05,1,20)
for ratio in ratio_array:
rmsle_sum_rf_gbt.append(evaluate_mixed_prediction(rf_all_sum, rf_all_count, gbt_all_sum, gbt_all_count,ratio=float(ratio))['rmsle_sum'])
rmsle_count_rf_gbt.append(evaluate_mixed_prediction(rf_all_sum, rf_all_count, gbt_all_sum, gbt_all_count,ratio=float(ratio))['rmsle_count'])
rmsle_sum_lr_rf.append(evaluate_mixed_prediction(lr_all_sum, lr_all_count,rf_all_sum, rf_all_count, ratio=float(ratio))['rmsle_sum'])
rmsle_count_lr_rf.append(evaluate_mixed_prediction( lr_all_sum, lr_all_count,rf_all_sum, rf_all_count,ratio=float(ratio))['rmsle_count'])
rmsle_sum_lr_gbt.append(evaluate_mixed_prediction(lr_all_sum, lr_all_count, gbt_all_sum, gbt_all_count,ratio=float(ratio))['rmsle_sum'])
rmsle_count_lr_gbt.append(evaluate_mixed_prediction(lr_all_sum, lr_all_count, gbt_all_sum, gbt_all_count,ratio=float(ratio))['rmsle_count'])
We can see that split proportion around 0.3 gives us best result.
fig,axes= plt.subplots(nrows=3, ncols=2, figsize=(15,10))
axes[0][0].plot(ratio_array, rmsle_sum_rf_gbt, c='orange')
axes[0][0].set_xlabel('Ratio')
axes[0][0].set_ylabel('RMSLE')
axes[0][0].set_title('SUM: Ratio * RandomForest + (1 - Ratio) * GradientBoostedTree')
axes[0][1].plot(ratio_array, rmsle_count_rf_gbt, c='orange')
axes[0][1].set_xlabel('Ratio')
axes[0][1].set_ylabel('RMSLE')
axes[0][1].set_title('COUNT: Ratio * RandomForest + (1 - Ratio) * GradientBoostedTree')
axes[1][0].plot(ratio_array, rmsle_sum_lr_rf, c='green')
axes[1][0].set_xlabel('Ratio')
axes[1][0].set_ylabel('RMSLE')
axes[1][0].set_title('SUM: Ratio * LinearRegression + (1 - Ratio) * RandomForest')
axes[1][1].plot(ratio_array, rmsle_count_lr_rf, c='green')
axes[1][1].set_xlabel('Ratio')
axes[1][1].set_ylabel('RMSLE')
axes[1][1].set_title('COUNT: Ratio * LinearRegression + (1 - Ratio) * RandomForest')
axes[2][0].plot(ratio_array, rmsle_sum_lr_gbt, c='red')
axes[2][0].set_xlabel('Ratio')
axes[2][0].set_ylabel('RMSLE')
axes[2][0].set_title('SUM: Ratio * LinearRegression + (1 - Ratio) * GradientBoostedTree')
axes[2][1].plot(ratio_array, rmsle_count_lr_gbt, c='red')
axes[2][1].set_xlabel('Ratio')
axes[2][1].set_ylabel('RMSLE')
axes[2][1].set_title('COUNT: Ratio * LinearRegression + (1 - Ratio) * GradientBoostedTree')
plt.subplots_adjust(hspace = 0.3)
plt.tight_layout
Get the best ratio split.
best_ratio = float(ratio_array[rmsle_sum_rf_gbt.index(min(rmsle_sum_rf_gbt))])
best_ratio
print('%.3f' % min(rmsle_sum_rf_gbt))
print('%.3f' % min(rmsle_count_rf_gbt))
Final best result on 'testing' data:
Although on validation set those models have best results, on Kaggle test set they have poorer results. Seems like Gradient Boosted is overfitting data.
# predicted_Kaggle_sum_cat = {}
# predicted_Kaggle_count_cat = {}
# for hour_cat in hour_cat_list:
# rf_predictionsTestData_r, rf_predictionsTestData_c, rf_predictionsTestData_count = make_prediction(rf_cv_models_dict[hour_cat], feature_selector_no_dummy[hour_cat], hour_cat, dummy=False, Kaggle=True)
# gbt_predictionsTestData_r, gbt_predictionsTestData_c, gbt_predictionsTestData_count = make_prediction(gbt_cv_models_dict[hour_cat], feature_selector_no_dummy[hour_cat], hour_cat, dummy=False, Kaggle=True)
# predicted_Kaggle_sum_cat[hour_cat] = predict_Kaggle_test(rf_predictionsTestData_r, rf_predictionsTestData_c, gbt_predictionsTestData_r, gbt_predictionsTestData_c, lambdas=bc_lambda_dict[hour_cat], ratio=best_ratio)
# predicted_Kaggle_count_cat[hour_cat] = predict_count_Kaggle_test(rf_predictionsTestData_count, gbt_predictionsTestData_count, lambdas=bc_lambda_dict[hour_cat], ratio=best_ratio)
predicted_Kaggle_sum_cat = {}
predicted_Kaggle_count_cat = {}
for hour_cat in hour_cat_list:
rf_predictionsTestData_r, rf_predictionsTestData_c, rf_predictionsTestData_count = make_prediction(rf_cv_models_dict[hour_cat], feature_selector_no_dummy[hour_cat], hour_cat, dummy=False, Kaggle=True)
lr_predictionsTestData_r, lr_predictionsTestData_c, lr_predictionsTestData_count = make_prediction(lr_cv_models_dict[hour_cat], feature_selector_dummy[hour_cat], hour_cat, dummy=True, Kaggle=True)
predicted_Kaggle_sum_cat[hour_cat] = predict_Kaggle_test(rf_predictionsTestData_r, rf_predictionsTestData_c, lr_predictionsTestData_r, lr_predictionsTestData_c, lambdas=bc_lambda_dict[hour_cat], ratio=float(0.2))
predicted_Kaggle_count_cat[hour_cat] = predict_count_Kaggle_test(rf_predictionsTestData_count, lr_predictionsTestData_count, lambdas=bc_lambda_dict[hour_cat], ratio=float(0.2))
all_sum_Kaggle, all_count_Kaggle = concatenate_Kaggle_predictions(predicted_Kaggle_sum_cat, predicted_Kaggle_count_cat)
all_sum_Kaggle['datetime'] = pd.to_datetime(all_sum_Kaggle['datetime'])
kaggleSubmission_sum = all_sum_Kaggle.sort_values("datetime").reset_index(drop=True)
all_count_Kaggle['datetime'] = pd.to_datetime(all_count_Kaggle['datetime'])
kaggleSubmission_count = all_count_Kaggle.sort_values("datetime").reset_index(drop=True)
kaggleSubmission_sum.head()
Kaggle result: 0.44472
kaggleSubmission_count.head()
Kaggle result: 0.42860
kaggleSubmission_sum.to_csv('../data/kaggleSubmission_sum.csv', index=False)
kaggleSubmission_count.to_csv('../data/kaggleSubmission_count.csv', index=False)
Due to computation and time limitations there are ares where additional research could be done:
0.42860